EmoMatch Task

Simple transfer-learning task based on the VoxCeleb dataset to pretrain networks working on videos (audio + video) This code requires you to download the VoxCeleb dataset and to extract it (both audio and video).

The idea of this aproach is based on the paper Look, Listen, Learn: here, audio and video information were used to pretain an image encoder network to be used for image classificaiton tasks.

This project tries to extend this approach to not only train an image encoder but to actually pre-train a network that is able to process both audio and video information. The task the network is meant to solve is rather simple: given an audio sequence and a video sequence, decide whether the two match (i.e. have the same origin).

Structure of the EmoMatch training procedure is shown in the image below. The left side showsthe data preparation while the right side illustrates the data flow through thenetwork. In the data preparation video recordings are used to separate theirvideo and audio track. These tracks are then feed into a VNet and an ANet for the video respectively the audio. These networks serve as an encoder to generate features for a classifier network. This classifier will then detect whether the audiotrack originates from the same original recording as the video track (Match) or from a two different recordings (No Match).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
avnet_model.png		avnet_model.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

avnet_model.png

avnet_model.png

Repository files navigation

EmoMatch Task

About

Releases

Packages

Languages

License

zimmerrol/emomatch-pytorch

Folders and files

Latest commit

History

Repository files navigation

EmoMatch Task

About

Resources

License

Stars

Watchers

Forks

Languages